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SUPPORT VECTOR MACHINES FOR CURRENT STATUS DATA 


By Yael Travis-Lumer and Yair Goldberg 
University of Haifa 

Current status data is a data format where the time to event is 
restricted to knowledge of whether or not the failure time exceeds a 
random monitoring time. We develop a support vector machine learn¬ 
ing method for current status data that estimates the failure time 
expectation as a function of the covariates. In order to obtain the 
support vector machine decision function, we minimize a regularized 
version of the empirical risk with respect to a data-dependent loss. 
We show that the decision function has a closed form. Using finite 
sample bounds and novel oracle inequalities, we prove that the ob¬ 
tained decision function converges to the true conditional expectation 
for a large family of probability measures and study the associated 
learning rates. Finally we present a simulation study that compares 
the performance of the proposed approach to current state of the art. 


1. Introduction. In this paper we aim to develop a general, model free, method for analyzing current 
status data using machine learning techniques. In particular, we propose a support vector machine (SVM) 
learning method for estimation of the failure time expectation for current status data. SVM was originally 
introduced by Vapnik in the 1990’s and is firmly related to statistical learning theory (Vapnik, 1999). The 
choice of SVMs for current status data is motivated by the fact that SVMs can be implemented easily, have 
fast training speed, produce decision functions that have a strong generalization ability and can guarantee 
convergence to the optimal solution, under some weak assumptions (Shivaswamy et ah, 2007). 

Current status data is a data format where the failure time T is restricted to knowledge of whether or not T 
exceeds a random monitoring time C. This data format is quite common and includes examples from various 
fields. Jewell and van der Laan (2004) mention a few examples including: studying the distribution of the age of 
a child at weaning given observation points; when conducting a partner study of HIV infection over a number of 
clinic visits; and when a tumor under investigation is occult and an animal is sacrificed at a certain time point 
in order to determine presence or absence of the tumor. For instance, in the last example of carcinogenicity 
testing, T is the time from exposure to a carcinogen and until the presence of a tumor, and C is the time point 
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at which the animal is sacrificed in order to determine presence or absence of the tumor. Clearly, it is difficult 
to estimate the failure time distribution since we cannot observe the failure time T. These examples illustrate 
the importance of this topic and the need to find advanced tools for analyzing such data. 

We present a support vector machine framework for current status data. We propose a learning method, 
denoted by CSD-SVM, for estimation of the failure time expectation. We investigate the theoretical properties 
of the CSD-SVM, and in particular, prove consistency for a large family of probability measures. In order to 
estimate the conditional expectation we use a modified version of the quadratic loss. Using the methodology of 
van der Laan and Robins (1998), we construct a data dependent version of the quadratic loss. Since the failure 
time T is not observed, our data dependent loss function is based on the censoring time C and on the current 
status indicator. Finally, in order to obtain a CSD-SVM decision function for current status data, we minimize 
a regularized version of the empirical risk with respect to this data-dependent loss. 

There are several approaches for analyzing current status data. Traditional methods include parametric 
models where the underlying distribution of the survival time is assumed to be known (such as Weibull, Gamma, 
and other distributions with non-negative support). Other approaches include semiparametric models, such as 
the Cox proportional hazard model, and the accelerated failure time (AFT) model (see, for example, Klein and 
Moeschberger, 2013). In the Cox model, the hazard function is assumed to be proportional to the exponent of 
a linear combination of the covariates. In the AFT model, the log of the failure time is assumed to be a linear 
function of the covariates. Several works including Diamond et al. (1986), Jewell and van der Laan (2004) and 
others have suggested the Cox proportinal hazard model for current status data, where the Cox model can be 
represented as a generalized linear model with a log-log link function. Other works including Tian and Cai (2006) 
discussed the use of the AFT model for current status data and suggested different algorithms for estimating 
the model parameters. Needless to say that both parametric and semiparametric models demand stringent 
assumptions on the distribution of interest which can be restrictive. For this reason, additional estimation 
methods are needed. 

Over the past two decades, some learning algorithms for censored data have been proposed (such as neural 
networks and splitting trees), but mostly with no theoretical justification. Additionally, most of these algorithms 
cannot be applied to current status data but only to other, more common, censored data formats. Recently, 
several works suggested the use of SVMs for survival data. Van Belle et al. (2007) suggested the use of SVMs for 
survival analysis, and formulated the task as a ranking problem. Shortly after, Khan and Zubek (2008) suggested 
the use of SVMs for regression problems with censored data; this was done by asymmetrically modifying the e- 
insensitive loss function. Both examples were empirically tested but lacked theoretical justification. Eleuteri and 
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Taktak (2012) proposed an empirical quantile risk estimator, which can also be applied to right censoring, and 
studied the estimator’s performance. Goldberg and Kosorok (2012) studied an SVM framework for right censored 
data and proved that the algorithm converges to the optimal solution. Shiao and Cherkassky (2013) suggested 
two SVM-based formulations for classification problems with survival data. These examples illustrate that initial 
steps in this direction have already been taken. However, as far as we know, the only SVM-based work that 
can also be applied to current status data is by Shivaswamy et al. (2007) which has a more computational and 
less theoretic nature. The authors studied the use of SVM for regression problems with interval censoring and, 
using simulations, showed that the method is comparable to other missing data tools and performs significantly 
well when the majority of the training data is censored. 

The contribution of this work includes the development of an SVM framework for current status data, 
the study of the theoretical properties of the CSD-SVM, and the development of new oracle inequalities for 
censored data. These inequalities, together with finite sample bounds, allow us to prove consistency and to 
compute learning rates. 

The paper is organized as follows. In section 2 we describe the formal setting of current status data and discuss 
the choice of the quadratic loss for estimating the conditional expectation. In section 3 we present the proposed 
CSD-SVM and its corresponding data-dependent loss function. Section 4 contains the main theoretical results, 
including finite sample bounds, consistency proofs and learning rates. In section 5 we illustrate the estimation 
procedure and show that the solution has a closed form. Section 6 contains the simulations. Concluding remarks 
are presented in section 7. The lengthier proofs appear in Appendix A. The Matlab code for both the algorithm 
and for the simulations can be found in the Supplementary Material. 

2. Preliminaries. In this section we present the notations used throughout the paper. First we describe 
the data setting and then we discuss briefly loss functions and risks. 

Assume that the data consists of n i.i.d. random triplets D = {(Z^, Ci, Ai),..., (Zn, Cn, A„)}. The random 
vector Z is a vector of covariates that takes its values in a compact set Z C M'’*. The failure-time T is non¬ 
negative, the random variable C is the censoring time, the indicator A = 1{T < C} is the current status 
indicator at time C, and is contained in the interval [0, r] = V for some constant r > 0. For example, in 
carcinogenicity testing, an animal is sacrificed at a certain time point in order to determine presence or absence 
of the tumor. In this example, T is the time from exposure to a carcinogen and until the presence of a tumor, C 
is the time point at which the animal is sacrificed, and A is the current status indicator at time C (indicating 
whether the tumor has developed before the censoring time, or not). 

We now move to discuss a few definitions of loss functions and risks, following Steinwart and Christmann 
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(2008). Let(Z,A) be a measurable space and 3^ C M be a closed subset. Then a loss function is any measurable 
function L from Z x T x M to [0, oo). 

Let L : X T X M —>■ [0, oo) be a loss function and P be a probability measure on Z x y. For a measurable 
function f : Z M., the L-risk of / is defined by Rl,p if) = Pp [L {Z, Y, f{Z))] = ^ fi^)) dP{z, y). 

A function / that achieves the minimum P-risk is called a Bayes decision function and is denoted by /*, and 
the minimal L-risk is called the Bayes risk and is denoted by Rfp- Finally, the empirical L-risk is defined by 

n 

Rl,d if) = ^^L(2;i,yi,/(2:i)). 

i=l 

For example, it is well known (see, for example, Hastie et ah, 2009) that the conditional expectation is the 
Bayes decision function with respect to the quadratic loss. 

3. Support Vector Machines for Current Status Data. Let P be a reproducing kernel Hilbert space 
(RKHS) of functions from Z to M, where an RKHS is a function space that can be characterized by some kernel 
function k : Z x Z By definition, if A; is a universal kernel, then R is dense in the space of continuous 

functions on Z, C{Z) (see, for example, Steinwart and Christmann 2008, Definition 4.52). Let us fix such an 
RKHS Ti and denote its norm by H-H.^ and let {An} > 0 be some sequence of regularization constants. An SVM 
decision function for uncensored data is defined by: 

1 

/d,a„ = argmin/g-^Anll/lll^ + - ^ L(Zi, T*,/(Z^)). 

i=l 

We recall that current status data consists of n independent and identically-distributed random triplets D = 
{{Zi,Ci, Ai),{Zn,Cn, An)}- Let F{-\Z = z) and G{-\Z = z) be the cumulative distribution functions of 
the failure time and censoring, respectively, given the covariates Z = z. Let g{-\Z = z) be the density of 
G{-\Z = z). For current status data, we introduce the following identity between risks, following van der Laan 
and Robins (1998). We extend this notion and incorporate loss functions and covariates in the following identity. 
Let L:VxMi—)■ [0,oo)bea loss function differentiable in the first variable. Let : T x M i-A- M be the derivative 
of L with respect to the first variable. 

We would like to find the minimizer of Pl,p(/) over a set R of functions /. Note that 
RlAI) ^EzEt\zL{TJ{Z)) 


Ez 

r Lit J{Z))dF{t\Z) 

Jo 


Ez 

nitjiz))ii-Fit\z))dt 

Jo 

- Lit, f{Z))il-F{t\Z))\l 

Ez 

[\itJiZ))il-Fit\Z))dt 

Jo 

+ P[L(0,/(Z))], 
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where the equality before last follows from integration by parts. Note also that (1 — A) = 1{T > C} and thus 


E 


il-A)i{C,fiZ)) 

9{C\Z) 


=Ez,t 

=Ez,t 

=Ez,t 

=Ez 

=Ez 


Ec 


l{T>C}iiC,f{Z)) 


9iC\Z) 

l{t > c}i{cj{z))g{c\z) 


Z = z,T = t 
dc 


L./0 9 {c\z) 

[ l{t> c}i{c,f{z))dc 
Uo j 

nr nr 

/ i{c,f{z)) / l{t > c}dE{t\z)dc 
Uo Jo 

- F{c\z))dc 


(l-A)t(C,/(Z)) 

9iC\Z) 


+ E[L{0,f{Z))]. Hence, 


This shows that the risk can be represented as the sum of two terms E 
in order to estimate the minimizer of RL^p{f), one can minimize a regularized version of the empirical risk with 
respect to the data-dependent loss function 

Z,”(D, (Z, C, A,»)) = ■ 

Note that this function need not be convex nor a loss function. For the quadratic loss function, our data- 
dependent loss function becomes 


L^iD,{Z,C,A,s)) = 


(l-A)2(C-s) 


9{C\Z) 

Note that this function is convex but not necessarily a loss function since it can obtain negative val¬ 
ues. In order to ensure positivity we add a constant term that does not depend on /, and so our loss be¬ 
comes L'^{D,{Z,C,A, f{Z))) = + (/(■^))^ + Q) where for a fixed dataset of length n, a = 

max \ Note that this additional term will not effect the optimization (since is just a shift by a 

l<i<n I ( 9 (Ci\Z,)f J U V J J 

constant of L^) and thus will be neglected here after. 

In order to implement this result into the SVM framework, we propose to define the CSD-SVM decision 
function for current status data by 


( 1 ) 


fD,x = argmin/6^A||/|||^ + 


2 = 1 


(1 - A,)2{Ci - f{Z,)) 

9{Ci\Zi) 




Note that if the censoring mechanism is not known, we can replace the density g with its estimate g; in this 
case our loss function becomes L'^{D, (Z, C, A, s)) = E (s)^ and the SVM decision function is ^ = 


argmin /g^A||/|||^ -|- + (/(^*))^ (note the use of g instead of g in the denominator). 


2=1 
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We note that for current status data, the assumption of some knowledge of the censoring distribution is 
reasonable, for example, when it is chosen by the researcher (Jewell and van der Laan, 2004). In other cases, 
the density can be estimated using either parametric or nonparametric density estimation techniques such as 
kernel estimates. It should be noted that the censoring variable itself is not censored and thus simple density 
estimation techniques can be used in order to estimate the density g. 


4. Theoretical Results. In this section we prove consistency of the CSD-SVM learning method for a large 
family of probability measures and construct learning rates. We first assume that the censoring mechanism is 
known, and thus the true density of the censoring variable g is known. Using this assumption, and some 
additional conditions, we bound the difference between the risk of the CSD-SVM decision function and the 
Bayes risk in order to form finite sample bounds. We use this result, together with oracle inequalities, to show 
that the CSD-SVM converges in probability to the Bayes risk. That is, we demonstrate that for a very large 
family of probability measures, the CSD-SVM learning method is consistent. We then consider the case in which 
the censoring mechanism is not known and thus the density g needs to be estimated. We estimate the density g 
using nonparametric kernel density estimation and develop a novel finite sample bound. We use this bound to 
prove that the CSD-SVM is consistent even when the censoring distribution is not known. Finally we construct 
learning rates for the CSD-SVM learning method for both g known and unknown. 


Definition 1. Let L{y, s) = be the normalized quadratic loss, let l{y, s) 

respeet to the first variable, and let L^{D, [Z, C, A, s)) = ^ 
of this loss. 


= be its derivative with 

be the data-dependent version 


For simplicity, we use the normalized version of the quadratic loss. 

Since both L and I are convex functions with respect to s, then for any compact set S = [—S, 5] C M, Both 
L and I are bounded and Lipschitz continuous with constants cl and ci that depend on S. 

Remark 1. L{y, 0) < 1 for all y & y and £{y, s) < Bi for all {y,s) G y x S and for some constant Bi > 0. 

We need the following assumptions: 

(Al) The censoring time C is independent of the failure time T given Z. 

(A2) C takes its values in the interval [0,r] and inf g {c\z) > 2K > 0, for some K > 0 . 

(A3) Z C is compact . 

(A4) B is an RKHS of a continuous kernel k with ||/c||gQ < 1 . 
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Define the approximation error by j 42(A) = inf A ||/||^ + -Rl,p(/) — ^2 p 

f&H ’ 

Define B 2 = cpA“^/^ + 1 and B = ^ + B 21 where Bi is defined in Remark 1. 

4.1. Case I - The Censoring Density g is Known. In this section we develop finite sample bounds assuming 
that the censoring density g is known. 


Theorem 1. Assume that (Al)-(A4) hold. Then for fixed A > 0, n > 1, e > 0, and 6 > 0, with probability 
not less than 1 — 


,j2log{2N{^BH,\\-\UD)+20 , 2c, 

’ V n K 


^ + ^CLS 


A II/daII-H + RL,p{fD,x) - Rl,P - ^2(A) < B 
where N{\~2 Bh, IMIqo t) f-s the covering nu 
and where Bjj is the unit ball of% (for further details see Steinwart and Christmann 2008) . 


where N{\ ^Bfj, IMIoo t) ts the covering number of the e — net of \l)^B h with respect to supremum norm 


The proof of this theorem appears in Appendix A.l. 

We now move to discuss consistency of the CSD-SVM learning method. By definition, P-universal consistency 
means that for any e > 0, 


(2) hm P{D €(Zx yr : KlAIdm) < + e) = 1 

where TZf p is the Bayes risk. Universal consistency means that (2) holds for all probability measures P on 
Z X y. However, in survival analysis we have the problem of identifiability and thus we will limit our discussion 
to probability measures that satisfy some identification conditions. Let V be the set of all probablity measures 
that satisfy assumptions (A1)-(A2). We say that a learning method is R-universal consistent when (2) holds for 
all probability measures P € V. 

In order to show R-universal consistency, we utilize the finite sample bounds of Theorem 1. The following 
assumption is also needed for proving R-universal consistency: 

(A5) inf 'R.L,p{f ) = TTf p, for all probability measures P on Z x y 
f sP ’ 

Assumptio (A5) means that our RKHS T-L is rich enough to include the Bayes decision function. 

Corollary 1. Assume the setting of Theorem 1 and that Assumptio (A5) holds. Let Xn be a sequence such 
that An —)• 0 and XnU —)• oo. Choose e = n“^, for some p > 0. Then the CSD-SVM learning method is 

n^oo n^oo 

P-universal consistent. 
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A \\fD,x\\n + RL,p{fD,x) - Rl,P - A2{X) > 


2lo. 


i5H,IMIoo,e)) + 20 


n 


2cie 
+ + 4.CLS, 


with probability not greater than e“®. 

Choose A = A^; from Assumption (A5) together with Lemma 5.15 of Steinwart and Christmann (2008, 5.15), 
A 2 (An) converges to zero as n converges to infinity. Clearly 


B 


\ 


2lo 


jRh, IMIoo ) f)) + 2 t 


0 . 


n 


Finally, from the choice of e, it follows that both and converge to 0 as n —)> oo. Hence for every fixed 

0, 


An ||/d,A„II^ + RL,p{fD,\„) — Rl,P — ^2(An) + B\^ 


2/o. 


IMIoo + 20 2cie 


+ 


n 


K 


+ 4cLe 


with probability not less than l-e“^. The right hand side of this converges to 0 as n —?• oo, which implies 
consistency (Steinwart and Christmann, 2008, Lemma 6.5). Since this holds for all probability measures P € V, 
we obtain "P-universal consistency. □ 


4.2. Case II - The Censoring Densityg is Unknown. In this section we form finite sample bounds for the 
case in which the censoring density is not known and needs to be estimated. We assume that the density of the 
censoring variable is estimated using nonparametric kernel density estimation. In Lemma 1 we construct finite 
sample bounds on the differnce between the estimated density g and the true density g. In Theorem 2 we utilize 
this bound to form finite sample bounds for the CSD-SVM learning method. 

Definition 2. We say that AT : M i—)> M (not to be confused with the kernel function k of the RKHS TL) is a 
kernel of order m, if the functions u i-A u^K{u) , j = 0,1, ...,m are integrable and satisfy K{u)du = 1 and 
fZo u^K{u)du = 0, j = 1,..., m. 

Definition 3. The Holder class X](/3,T) of functions / : M i-a M is the set of m = [/3J times differentiable 
functions whose derivative satisfies \ f^"^\x) — f^^\x')\ < £|x — x'\^~'^ for some constant C > 0. 

Lemma 1. Let K : M. M. be a kernel function of order m satisfying K^(u)du < oo and define g{x) = 
where h is the bandwidth. Suppose that the true density g satisfies g{c) < gmax < oo. Let us 
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also assume that g{c) belongs to the Holder class Finally, assume that \u\^ du < oo. Then 

for any e > 0, 

- 9{ci)\ >e + C2-hd^< 

where Ci = gmax (v) dv and C 2 = — f2x) 1^1^ constants, and for some tt G [0,1]. 

The proof of the lemma is based on Tsybakov (2008, Propositions 1.1 and 1.2) together with basic concen¬ 
tration inequalities; the proof can be found in Appendix A.2. 

We would like to choose h that minimizes the sum of C 2 ■ h^ and Define U(h) = C 2 ■ h^ + 22^- 

Taking the deivative of U with respect to h and setting to zero yields: 

dh 2 V ne^ 


( 3 ) 


<^h = 




2l3C2ey/n 


2 

2/3+1 


= K\n 2 


\ 2/3+1 


oc n 2/3+1 


where k = (*^i) ^ 130 shown that the second derivative of U is positive which guarentees that the 

(2/3C2e)273+T 

_ 1 _ 1 

zero of the derivative above is the minimizer. After substituting h = m 2 / 3+1 in U, we obtain that U{ku 2 / 3 + 1 ) oc 


n 2 / 3 + 1 . 


Choosing e > 0 such that ln{e) = + 5 ^n-(C'i) — ^ln{n) -t- ^ln{2(3C2) and substituting h = nn 2 / 3+1 ^ we 


2/3^ 


obtain by Lemma 1 that 


Pr 


i ^ \9{ci) - g{ci)\ > e -b C2K^n 2/3+1 j < 


_/ (71712/3+1 


= e 


-e 


i=l 


Kne^ 


We now move to construct finite sample bounds for the CSD-SVM learning method when g is unknown using 
the above lemma. We assume that g is the kernel density estimate of g, such that the conditions of Lemma 1 
hold. 


Theorem 2. Assume that (Al)-(A4) hold. Assume the setting of Lemma 1 and that inf g {c\z) > A > 0, 

z£Z,c^C 

for some A > 0. Choose a such that 

0 < ((7i)2 (2/3(72)^ n ~2 < a <2 (Ci)^ {2(502)^ n ~2 and ln{a) = - 2n{n) + -^ln{2(3C2). 

Then for fixed A > 0, 0 > 0, n > 1, e > 0, we have with probability not less than 1 — 2e“® that 
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A ||/d,aII1 + RlAId^x) - Rl,P - A2{X) < B\^ 


2lo 


\BH,\\-A,e)) + 29 2,cie 


n 


+ -j^ + 4cz,e + 2r] 
K 


, Bi(a+C2-hl^) 

where rj = — 2 K^ - - ' 


The proof of the theorem appears in Appendix A.3. 

Using the above theorem we show that under some conditions, the CSD-SVM decision function converges in 
probability to the conditional expectation. 


Corollary 2. Let An be a sequence such that An —>• 0 and that XnU —>■ oo. Choose e = n for some 

n^oo n^oo 

p > 0. Assume the setting of Theorem 3, then the CSD-SVM learning method is consistent. 

The proof of the corollary is derived similarly to the proof of Corollary 1 (consistency - case I). 

4.3. Learning rates. In this section we derive learning rates for cases I and II. 

Definition 4. A learning method is said to learn with rate Cn C (0,1] that converges to zero if for all 
n > 1 and all r G (0,1], Pr ^7^l,p(/d) — p ^ cpCrCn^ > 1 — t , where Cr and cp are constants such that 
Cr G [ 1 , 00 ) and cp > 0. 

Theorem 3. Assume that (Al)-(A4) hold. Choose 0 < A < 1 and assume that there exist constants a > 

1, p > 0 such that log{N{BH, IHIgo , e)) < ae~^P. Additionally, assume that there exist constants c > 0, 7 G (0,1] 

such that A 2 (A) < cA'’'. Choose An =n (i+p)( 27 +i) . Then 

_ 1 

(i) If g is known, the learning rate is given by n (i+p){ 27 +i) . 

(ii) If g is not known and the setup of Theorem 2 holds, then the leraning rate is given by n v (i+p)( 27 + 1 ) ’ 2 / 3+1 ^ _ 

The proof of the theorem appears in Appendix A.4. 

5. Estimation of the Failure Time Expectation. In this section we demonstrate how to compute 
the CSD-SVM decision function with respect to the quadratic loss. In addition we show that the solution 
has a closed form. Since L'^{D, {Z,C, A, s)) = 4" convex, then for any RKHS TL over Z 

and for all A > 0, it follows that there exists a unique SVM solution fp^x. In addition, by the Representer 
Theorem (Steinwart and Christmann, 2008, 5.5), there exists constants a = {ai, ...,an)'^ G M"' such that 
fD,x{z) = z ^ Z. Hence the optimization problem reduces to estimation of the vector a. A 

more general approach will also include an intercept term b such that fD,xi^) = YA=i Zi) + b. 
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Let <P : Z ^ Ti he the feature map that maps the input data into an RKHS Ti such that ($( 2 ;^), <h(z)) = 
k{zj,z). Our goal is to find a function that is the solution of (1). From the Representer Theorem, there 
exists a unique SVM decision function of the form 0 (j^{zj) + b. 

Define for each a G M” the function w{a) by w{a) = ctj^izj). 

Then for Oa = ^, the optimization problem reduces to: 


. Ca 
mm — 
2 


E 


(1 - A)2n 

giCilZi) 


+ {ti - Vi 


2 



2 


such that Ti = Ci — f{zi) 
where f{zi) =< w,^(zi) > +6. 

This is an optimization problem under equality constraints and hence we will use the method of Lagrange 
multipliers. The Lagrangian is given by 


Lagrangep = ^ ^ 


1=1 


(1 - A)2r, 

g{C^\Zi) 


+ (ci - Vif 


+ + E“* < 'W,^{zi) > -b- ri) 


i=l 


Minimizing the original problem Lagrangep yields the following conditions for optimality: 


n 

w = y^^aj^izi) 
i=l 


_ai (1 - Ai) 

~ Cx ' m\Zi) 


Qi = 0 - 

i=l 

Since these are equality constraints in the dual formulation, we can substitute them into Lagrangep to obtain 
the dual problem Lagrangep- By the strong duality theorem (Bazaraa et ah, 2006, Theorem 6.2.4), the solution 
of the dual problem is equivalent to the solution of the primal problem. 


Lagrangep 



2 = 1 


( 1 -A) 2 (g): + Q- 


(i-^d ^ 

23(Ci|Zi); 


g{Ci\Zi) 


+ 


( (1 - A) 

\2g{Ci\Zi) 




n n 

i=l j=l 


n I n 

+ E “ E ^ 

i=l \ j=l 



2g{Ci\Zi)) j 
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Some calculations yield: 


1 2 

1 at 


n — /\T 

LagrangeD = ^ - 5 E § 

i=l ' i=l 7=1 i=l 




i=l j=l 
A a 


subject to the constraint Yll=i = 0 ) where ^ §(c-^A) ’ §(C tA) )' 

This is a quadratic programming problem subject to equality constraints. Its solution satisfies: 


Ctl 


0:2 


a-n 


, b ; 



/ 


K21 


Ki2 

K22 + ^^ 


Km 

K2n 


\ 


-1 


Knl 

1 


Kn2 

1 


K -I- — 1 



Vi 


V2 


Vn 


^ 0 ; 


Note that if we do not require an intercept term, the solution is a = interesting 

to note that this solution is equivalent to the solution attained by the Representer Theorem for differen¬ 
tiable loss functions: a* = {xi,yi, fD,\ixi)) (Steinwart and Christmann, 2008, Section 5.2). In our case, 

L„{C„f{Z,)} = tonce a, = 5 ^!;^ (O,/(Z,)) = ^ ( '‘j(C.' | g’ + 2/(Z.)) and 

since f{Zi) = oijk{zi, Zj), we see that a = — ^Ka, i.e., a = + -^A 


6. Simulation Study. In this section we test the CSD-SVM learning method on simulated data and 
compare its performance to current state of the art. We construct four different data-generating mechanisms, 
including one-dimensional and multi-dimentional settings. For each data type, we compute the difference be¬ 
tween the CSD-SVM decision function and the true expectation. We compare this result to results obtained by 
the Cox model and by the AFT model. As a reference, we compare all these methods to the Bayes risk. 

For each data setting, we considered two cases;: (i) the censoring density g is known; and (ii) the censoring 
density is not known. For the second setting, the distribution of the censoring variable was estimated using 
nonparametric kernel density estimation with a normal kernel. The code was written in Matlab, using the 
Spider library^. In order to fit the Cox model to current status data, we downloaded the TCsurv’ R package 
(Wang, 2014). In this package, monotone splines are used to estimate the cumulative baseline hazard function, 
and the model parameters are then chosen via the EM algorithm. We chose the most commonly used cubic 


'^The Spider library for Matlab can be downloaded from littp://www.kyb.tuebingen.mpg.de/bs/people/spider/ 
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splines. To choose the number and locations of the knots, we followed Ramsay (1988) and McMahan et al. 
(2013) who both suggested using a fixed small number of knots and thus we placed the knots evenly at the 
quantiles. For the AFT model, we used the ‘surv reg’ function in the ‘Survival’ R package (Therneau and 
Lumley, 2014). In order to call R through Matlab, we installed the R package rscproxy (Baier, 2012), installed 
the statconnDCOM server^, and download the Matlab R-Link toolbox (Henson, 2004). For the kernel of the 
RKHS Ff, we used both a linear kernel and a Gaussian RBF kernel k{xi,Xj) = exp ^ where a and 

C\ were chosen using 5-fold cross-validation. The code for the algorithm and for the simulations is available for 
download; a link to the code can be found in the Supplementary Material. 

We consider the following four failure time distributions, corresponding to the four different data-generating 
mechanisms: (1) Weibull, (2) Multi-Weibull, (3) Multi-Log-Normal, and (4) an additional example where the 
failure time expectation is triangle shaped. We present below the CSD-SVM risks for each case and compare 
them to risks obtained by other methods. The risks are based on 100 iterations per sample size. The Bayes risk 
is also plotted as a reference. 

In Setting 1 (Weibull failure-time), the covariates Z are generated uniformly on [0,1], the censoring variables C 
is generated uniformly on [0, r], and the failure time T is generated from a Weibull distribution with parameters 
scale = shape = 2. The failure time was then truncated at r = 1. 

Figure 1 compares the results obtained by the CSD-SVM to results achieved by the Cox model and by the 
AFT model, for different sample sizes. It should be noted that both the PH and the AFT assumption hold for 
the Weibull failure time distribution. In particular, when the PH assumption holds, estimation based on the Cox 
regression is consistent and efficient; hence, when the PH assumption holds, we will use the Cox regression as a 
benchmark. Figure 1 shows that when g is known, even though the CSD-SVM does not use the PH assumption 
or the AFT assumption, the results are comparable to those of the Cox regression, and are better than the 
AFT estimates, especially for larger sample sizes. However, when g is not known, the Cox model produces the 
smallest risks, but its superiority reduces as the sample size grows. This coincides with the fact that when g is 
not known, the learning rate of the CSD-SVM is slower. 

In Setting 2 (Multi-Weibull failure-time), the covariates Z are generated uniformly on [0,1]^*^, and the cen¬ 
soring variable C is generated uniformly on [0, r], as in setting 1. The failure time T is generated from a Weibull 
distribution with parameters scale = —0.5Zi -|- 2 Z 2 — Z 3 , shape = 2. The failure time was then truncated 
at r = 2. Note that this model depends only on the first three variables. In Figure 2, boxplots of risks are 
presented. Figure 2 illustrates that the CSD-SVM with a linear kernel is superior to the other methods, for all 

■^Baier Thomas, & Neuwirth Erich (2007). Excel :: GOM :: R. Computational Statistics, Volume 22, Number 1/April 2007. 


Physica Verlag. 
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Risks for Weibull distribution, case i 



Risks for Weibuli distribution, case il 

0.14 I--- 



Fig 1 . Weibull failure time distribution. The Bayes risk is the dashed black line and the boxlpots of the following risks are compared: 
CSD-SVM with an RBF kernel, CSD-SVM with a linear kernel, Cox and AFT, for sample sizes n = 50,100,200,400,800. 


sample sizes and for both the cases g known and uknown. However, since the data may be sparse in the feature 
space, the CSD-SVM with an RBF kernel might require a larger sample size to converge. 

In Setting 3 (Multi-Log-Normal), the covariates Z are generated uniformly on [0,1]^^, C was generated as 
before and the failure time T was generated from a Log-Normal distribution with parameters g = ^{0.3Zi -|- 
O. 5 Z 2 -|- O. 2 Z 3 ), (7 = 1. The failure time was then truncated at r = 7. Figure 3 presents the risks of the 
compared methods. This example illustrates that for small sample sizes, the CSD-SVM risks are significantly 
superior and converge quickly to the Bayes risk. As the sample size grows, the AFT also converges to the Bayes 
risk whereas the Cox estimates does not, as can be seen by the very high risks they produce. Note that for 
the Log-Normal distribution, even though the AFT assumption is correct, the CSD-SVM manages to produce 
better or equivalent results. 

In Setting 4, we considered a non-smooth conditional expectation function in the shape of a triangle. The 
covariates Z are generated uniformly on [0,1], C is generated uniformly on [0, r], and T was generated according 
to the following 


T = 


A + 6 ■ Z + e 
10 - 6 -Z-Fe 


,Z < 0.5 

, where e iV(0,l). 

,Z > 0.5 
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Risks for Multi-Weibull distribution, case i 



s 

> 


s 

> 


s 

> 


Risks for Muiti-Weibuii distribution, case ii 



Fig 2. Multi-Weibull failure time distribution. The Bayes risk is the dashed black line and the boxlpots of the following risks 
are compared: the CSD-SVM with an RBF kernel, the CSD-SVM with a linear kernel, Cox and AFT for sample sizes n = 
50,100,200,400,800. 


The failure time was then truncated at at r = 8. 

In Figure 4, the boxplots of risks are presented. As can be seen, the CSD-SVM with an RBF kernel is superior 
in both cases, for sufficently large sample sizes. 

To illustrate the flexibility of the CSD-SVM, we also present a graphical representation of the true conditional 
expectation and its estimates, as a function of the covariates. Figure 5 compares the true expectation to the 
computed estimates for the case that g is known; these estimates are based on the first iteration. As can be 
seen, the CSD-SVM with an RBF kernel produces the most superior results. 

To summarize. Figures 1-5 showed that the CSD-SVM is comparable to other known methods for estimating 
the failure time distribution with current status data, and in certain cases is even better. Specihcally, we found 
that the CSD-SVM with an appropriate kernel was superior in three out of the four examples, especially when 
the true density g is known. It should be noted that even when the assumptions of the other models were true 
the CSD-SVM estimates were comparable. Additionally, when these assumptions fail to hold, the CSD-SVM 
estimates were generally better. The main advantage of the proposed SVM approach is that it does not assume 
any parametric form and thus may be superior, especially when the assumptions of other models fail to hold. 
Additionally, it seems that the CSD-SVM can perform well in higher dimensions. 
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Risks for Multi-LogNormal distribution, case I 



s 

> 


s 

> 


Risks for Muiti-LogNormai distribution, case ii 



Fig 3. Multi-LogNormal failure time distribution. The Bayes risk is the dashed black line and the boxlpots of the following 
risks are compared: the CSD-SVM with an RBF kernel, the CSD-SVM with a linear kernel, Cox and AFT for sample sizes 
n = 50,100,200,400,800. 


7. Concluding Remarks. We proposed an SVM approach for estimation of the failure time expectation, 
studied its theoretical properties and presented a simulation study. We believe this work demonstrates an 
important approach in applying machine learning techniques to current status data. However, many open 
questions remain and many possible generalizations exist. First, note that we only studied the problem of 
estimating the failure time expectation and not other distribution related quantities. Further work needs to 
be done in order to extend the SVM approach to other estimation problems with current status data. Second, 
we assumed that the censoring is independent of the failure time given the covariates and that the censoring 
density is positive given the covariates over the entire observed time range. It would be worthwhile to study the 
consequences of violation of some of these assumptions. Third, it could be interesting to extend this work to 
other censored data formats such as interval censoring. We believe that further development and generalization 
of SVM learning methods for different types of censored data is of great interest. 

SUPPLEMENTARY MATERIAL 
Supplementary Material: Matlab Code 

(). The code can be downloaded from http://stat.haifa.ac.il/~ygoldberg/research. Please read the file 
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Risks for Triangle-shaped expectation, case I 



s 

> 


Risks for Triangle-shaped expectation, case II 



Fig 4. Triangle shaped failure time expectation. The Bayes risk is the dashed black line and the boxlpots of the following 
risks are compared: the CSD-SVM with an RBF kernel, the CSD-SVM with a linear kernel, Cox and AFT for sample sizes 
n = 50,100,200,400,800. 


README.pdf for details on the files in this folder. 


APPENDIX A: PROOES 

A.l. Proof of Theorem 1. 

Proof. Since L"’(D, (Z, C, A, s)) = is convex, it implies that there exists a unique 

SVM solution (see Steinwart and Christmann, 2008, Section 5.1). For all distributions Q on Z x y, we define 

the SVM decision function by fq^x = inf A ||/||^ + RL,Q{f)- We note that for an RKHS % oi a, continuous 

f^'H 

kernel k with \\k\\^ < 1, 


||/q,aIIoo ^ Halloo II/q,aII-h ^ \\fQ,x\\n 


Hence 
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True expectation vs estimated values,sample si 2 e= 50 True expectation vs estimated values,sample si 2 e= 100 




True expectation vs estimated values,sample si 2 e= 400 True expectation vs estimated values.sample si 2 e= 800 




Fig 5. Triangle shaped failure time expectation, case I (g is known). The true expectation is the blue line. The following estimates are 
compared: the CSD-SVM with an RBF kernel, the CSD-SVM with a linear kernel, Cox and AFT for sample sizes n = 50,100,400, 800. 


Hence ||/q,a|Ioo ^ WfcMly, ^ \J all / G By Remark 1, L(?/,0) < 1 for all y G T and so we 
conclude that < 1 and thus ||/q,aIIoo — II/q,a||-^ < \J\ for all distributions Q on Z xy. 

Recall that the unit ball of % is denoted by Bh and its closure by Bh] since||/p^A||-^ < \j\ we can write 
/ G ^J^BH. Since Z C is compact, it implies that the IHIcjo “ closure Bh of the unit ball Bh is compact in 
i^oiZ) (see Steinwart and Christmann, 2008, Corollary 4.31). 

Since fD,\ minimizes A ||/||^ + RL,D{f), 

A ||/d,a|Ip + RL,D{fD,x) < A ||/p,a|Ip + RL,D{fp,\)- 

Recall that the approximation error is dehned by A 2 {X) = inf A ||/||^+Rp_p(/) —P) and thus, as in Steinwart 

/ ’ 

and Christmann (2008, Eq. 6.18), 
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A \\fD,x\\li + RL,pifD,\) - Rl,P ~ ^2(A) 

=A ||/d,aII?^ + RL,p{fD,\) - A ||/p,a||^ - RL,p{fp,\) 

=A ||/d,a||^ + RL,D{fD,x) - RL,D{fD,x) + RL,p{fD,x) “ A ||/p,a||^ - RL,p{fp,x) 
<A ||/p,a||^ + RL,D{fp,x) - RL,D{fD,x) + RL,p{fD,x) “ A ||/p,a||^ - RL,p{fp,x) 
=RL,D{fp,x) — RL,D{fD,x) + RL,p{fD,x) “ RL,p{fp,x) 

<2 sup |i?L,p(/) --Rl,d(/)|- 

That is, 


( 4 ) 


A ||/d,a||^ +-Rl,p(/d,a) --Rl,p - 212 (A) < 2 sup |-Rl,p(/) --Rl,d(/)| 


Ilw-V A 


<U1- 


Note that since L is Lipschitz continuous, \L{y, s) — L{y, s')| < cp|s — s'| for all s, s' G S. 
From the discussion above, we are only interested in bounded functions / G \[\Bh- 


Then for all / G \j jBh we have 


\L{y,f{z))\<\L{y,f{z)) - L{y,0)\ + L{y,0) < cp|/(z)| + 1 < cpA + 1 = B 2 


thus we obtain that for functions f G J jBh, the loss L{y, f{z)) is bounded. 


For any e > 0, let Tl. be an e — net of jBh. Since Bh is compact, then the cardinality of the e — net is 


|.Te|=iV( =A^(S/f,|Mloo,^^e)<oo. 


Thus for every f G \ \Rh-, there exists a function h G Be with ||/ — h\\ < £, and thus 


(5) |.Rl,p(/) — RL,Dif)\ < \RL,pif) — RL,pih)\ + \RL,p{h) — i?L,p(/l)| + |i?L,D(/l) — i?L,p(/)l — An + Bn + Cn 


First we will bound Cn] 
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Cn = \RL,D{h) - -Rl,d(/)| 


n 

-E 

n ^ 

2=1 

Ul - Ai)iiCi,hiZi))-\ 

gCi\Zi) 

n 

-E 

n ^ 

2=1 

ui-Ai)iiCi,h{Zim 

[ g{Ci\z,) \ 


IL - ft 

2=1 2=1 


g{Ci\Zi) 


n 


Y^imfiz,))] 


2=1 


< 

—Cn,l “ 1 “ C*n, 2 ) 

where 

Cn,l = 


1 


E 

2=1 


il-Ai)iiCiJiZi)) 
g{Ci\Zi) 


+ 


^ fL - fL 

j;|i(o,ft(z.))|--EW»./(Zi))l 


n 


2=1 


2=1 


E 

2=1 

n r 


(1 - Ai)l{Ci, h{Zi)) (1 - Ai)£{Ci, f{Zi)) 


g{Ci\Zi 


g{Ci\Zi 




< 


1 " 

-E 

n ^ 


2=1 *- 


g{Ci\Zi 


iiiCi,hiZ,))-£iaj{Zi))) 


< 


-2K 


n 


Y,[m,Kz,))-i{Qjiz,))] 


i=l 


< 


n 

Y.\e{Q,h{Zi))-iia,f{z,))\ 


2nK 


i=l 


< 


' 2nK 


Y^cilHz,) - fiz,)\ < 


n 


2=1 


2nK 


2=1 


Q£ 

2K' 


and where 


Cn,2^ 


Y,[mh{Zi))-mf{z,))] 

2 = 1 

1 " 

<-V|L(0,h(Z,))-L(0 ,/(Z,))| 

n 

2=1 

^ n 1 ^ 

<-E«IMz.)-/(Zi)l< -Em 


= ClE 


j=l 


j=l 


So we were able to bound by c^e. 

Similarly, using to the property that E[a\ = a for any constant a, it can be shown that An < ^ + c^s. 
As an interim summary, we showed that 


(6) 


sup 


|-Rl,p(/) - RL,D{f)\ < sup |i?L,p(^) - RL,D{h)\ + 

heTe'' - V-^ 


1 

—cie + 2 cl£. 


=B„ 

Recall that the loss L{y, f{z)) is bounded by B 2 and that by Remark 1, £{y, s) < Bi. 
We note that 
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--+ L(0, HZ)) < + i(0, HZ)) = B 


Combining this with equation (4), we obtain that 


Pr ||/z,,a||^ + RlAId^x) - Rl,P - ^2(A) > + ^ + 


<Pr 


2 sup \RL,p{f) - ^l,d(/)| > B\/ — + + 4cLe 


V 


ll-H-V A 


<,/4 


n K 


(byeqi) 


<Pr I 2 { sup\RL,p{h) - RL,D{h)\ + —cie + 2cLe ) > B\j — + + 4ci,e ) {byeq. 6) 




2p 2cie 


n K 


=Pr ( 2 ( sup Bn + ^cie + 2 cl£] > B\ — + + 4cLe 


\heTe 

=Pr ( sup Bn> B 


K 



2n 


n K 


= Pr ( sup \RL,p{h) - RL,D{h)\ > B 



V 

2n 


By the union bound, the last expression is bounded by 

Y,Pr (iRpAh) - RLMh)\ > B 





r]_ 

2n j ’ 


which can then be bounded again by 2\PA ^^ using Hoeffdings inequality (Steinwart and Christmann, 2008, 


Theorem 6.10); where is an e-net of P j-Bn with cardinality 


l-Fel = N { -Bh, IMIoo 1 < oo- 


Dehne p = log(2|J>|) -|- 6, then 

Pr ^A \\fD,xfn + BLAfD,x) - Rl,P - ^2(A) > B 
which concludes the proof. 




n 


K 


□ 


A.2. Proof of Lemma 1. 


- 9{ci)\ < 

i=l 

■t PI 1 ^ 

<- E \ 3 {A - E [gA)\\ + - E \-9A)] - g{A\ ^A + B 

i=l 2=1 


Proof. Note that 
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As in Tsybakov (2008, Proposition 1.1), define rii{c) = K —EgK ■ Then rii{c), for i = 1, ...,n 


are i.i.d. random variables with zero mean and with variance: 


Yar[r]i{c)] = Eg {r]i{c)) 


= / 


= En 




K 


Ci-c 


— ^9 




2 / '-'1- 


Ci-c 


g(^U)du ^ Smax I E 


du — Qmax I E (n) dv — C\h 


where the equality before last follows from change of variables and where Ci = Qmax E^ (v) dv. Thus 
Var( 5 (c)) = Eg \ {^ E7=i diic))^] = ^Eg [rjl{c)] = % 


By the Cauchy-Schwarz inequality we have that E [| 5 (c) — E [ 5 (c)]|] < .^jE \g{c) — E [ 5 (c)]|^ = ^yV{g{c)). 
Hence E [| 5 (c) — E [ 5 (c)]|] < w ^.Therfore, by Markov’s inequality. 


PriA >e)=Pr(^^p^ |ff(c.) - E [gici)]\ > ej < [|g(^) 

For the second term, as in Tsybakov (2008, Proposition 1.2), we have that 

1 

^ X] 1-^ 


2=1 


where C 2 = \E (u)| |u|^ dv < 00 , and for some vr G [0,1]. 

In conclusion, we showed that 


< 


Cl 

nhe^ 


Er \g{ci) - g{ci)\ > e + C 2 ■ 


2 = 1 
n 


<Pr ( - \9{ci) - E [g{ci)] \ + - X] I-® Idici)] - g{ci)\ > e + C 2 -h^ 


2 = 1 


2=1 


<Pr - ^ \g{ci) - E [g{ci)]\ + C 2 ■ > e + C 2 ■ 


2 = 1 
n 




where h is the bandwidth. 


□ 


A.3. Proof of Theorem 2. 


Proof. Note that the proof of this theorem is similar to the proof of of Theorem 1 and thus we will only 
discuss the parts of the proof where they differ. As in Theorem 1, equation 5, 

A ||/d,a||^ + RL,p{fD,x) — R*L,P ~ ^2(A) < 2 [An + Bn + Cn) 
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where 


An = \RL,p{f ) - Rl,p{v)\, Bn = \Rl,p{v) - Rl,d{v)\, audwhere Cn = \Rl,d{'^) - -Rl,d(/)|, 

Since Undoes not depend on the data-set D, the same bound holds as in the proof of Theorem 1 , that is, 

An ^ 2 K ^L^- 

We bound Cn as follows: 


Cn = \Rl,d{v) - RL,D{f)\ 


1 ” 

1 ^ 

\{i - A,y{CiMZi))] 

n ^ 

2=1 

[ 9{Ci\Zi) \ 

1 

V 

Ul - A,y{Ci,v{Zi)A 

n ^ 

2=1 

[ 9{Ci\Zi) \ 


+ rElU».">(z.))l 

^ • 1 
1=1 


1 ^ ui-A,)i{Ci,f{z,)y 

^ ^ 1 - g{Ci\Zi) 


-^[L( 0 ./(Zi))l 


^ ^ 9{Ci\Zi) 


+ 


-Y,ipo.v(m 


n 

-53|L(0,/(Zi))l 


—Cn,l + C", 


n,2 


Using the same arguments as in Theorem 1 , we can bound Cn by ^ -t- cie. Note that the only difference is 
in the denominator of Cn,i since ^ ^ and ^ 

Recall that the loss L{y,f(z)) is bounded by -62- Define RL,D,g (u)by 


RL,D,g{v) = 


2 = 1 


{l-Ai)iiCi,v iZi)) 
9 {Ci\Z,) 


+ -^[L( 0 ,u(Z,))]. 
n 


i=l 


In other words, RL,D,g{v) is the empirical risk with the true censoring density function g. 
We bound Bn as follows 


Bn =\Rl,p{v) - Rl,d{v)\ 

< — RL,D,g{v)\ + \RL,D,g{v) — Rl,d{v)\ = -Bn,l + Bn,2 


where 


ii-A)e(c,viz)) 

9 {C\Z) 


T( 0 , v[Z)') < 


nC^vjZ)) 

9 {C\Z) 


-|- T( 0 , v^Z)) < 


2 K 


-l- -B 2 = R 


and where 
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Bn,2 = \ RL,D,g{v) “ Rl,d{v)\ = 

' {I- /\i)l{Cuv{Zi)) 

g{c^\Zi) 

{I -/\i)i{Ci,v{Zi)) 


E 

i=l 

n r 

-E 


n 


E 

2 = 1 


(1-A,K(C,,^(Z,)) 

g{Ci\Zi) 


2=1 *- 


1 


1 


1 

<-E 

2 = 1 

n r 


n 


KCiMZi)) 


g{C^\Zi) g{Ci\Zi) 

1 1 


g{Ci\Zi) g{Ci\Zi) 


Bi 


n 


E 

2 = 1 


g{Ci\Zi) - g{C,\Zi) 


g{Ci\Zi)g{Ci\Zi^ 


< 


Bi 

2 K‘^ 


n 


^mCi\Z,)-gia\Z,)\]. 


2=1 


Note that these inequalities hold for all functions u G C A We would like to bound the last 

1 

expression using Lemma 1. By equation 3, let h = nn 2^+1 ^ choose a such that 

0 < (( 71)2 (2/3(72)^ n “2 < a < 2 (Ci )2 (2/3(72)^ n “2 and ln{a) = - ^ln{n) + ^/n(2/3C2), 

and let g = ^ ^ > then by Lemma 1 


Pr{Bn,2 > 77) < Pr ( [ 15 ( 17 * 1 ^*) - g{Ci\Zi)\] > g 


=Pr 


- g{C.\Z, 

\ 2 = 1 


> 


Pi (a + (72 • hf^)' 


2 iL 2 


Pr{^ Y MCi\Zi) - g{Ci\Z,)\] > a + C2 • 


2 = 1 


< 


Cl 

nha^ 


= e 


-,-0 


We need to bound the term = \Rl,p{v) — RL,D,g{v)\- By the union bound, for all g > 0 


Pr Bn.i{o) > B\/£) = Pr (sup |iSz,p(u) - 

<^Pr (|Bi,p(u) - fiL,D,,(u)| > . 

We showed that E B. Hence by Hoeffdings inequality, the last term can then be 


bounded again by 2 |p£|e where 77 is an e-net of \j jBp with cardinality 


jBh, IMIoo P 


|Pe| = N 


< 00. 
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Define fj, = log(2|J>|) + 6, then 




In conclusion we have that 


Pr ||/,,,a||^ + RLAfD,x) - Rl,P - A 2 {\) > + ^ + + 2r/^ 


<Pr 


V 


2 sup |-Rl,p(/) - iiL,D(/)| >-By ^ ^+4cLe + 27/ 


ll-H-V A 


<a/x 


<Pr ^2 ^S7^|iiz,,p(7;) - RL,Div)\ + + ‘^cpe^ > By‘^ + + 4cpe + 2r] 

<Pr ^2 ^sup Bn,i{v) + Bn, 2 {v)^ > B^^ + 2r^ 


<Pr [ sup Bn, i{v) + Bn, 2 {v) > B ^+p 



<Pr ( supB„,i > B^ ^ ^supS„,2(w) > 7/ 

<e"® + e"® = 2e“® 


and the result follows. 


□ 


A.4. Proof of Theorem 3. 

Proof. Case I 
By Theorem 1, 


A \\fD,x\\l^ + RL,p{fD,x) - R1,P - A 2 {X) < B\^ 


2 / 0 . 


i^RJMIooT)) + 26' 2 cie 

-h + 4cpe 

77 K 


with probability not less than 1 — e For any compact set S = [—B, S] C M, Both L and I are bounded and 
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Lipschitz continuous with Lipschitz constants cl < and Q = ^. Hence, 

A ||/d,aII^ + RL,p{fD,x) - Rl,p - ^2(A) 

<B 


(7) 


<B\ 


=B\ 


2log{2N{BH, 

M 00 > ^/a^)) + 26 * 

n 

2log{2N{BH, 

M 00 > ^/a^)) + 2^ 

n 

2log{2N{BH, 

M 00 ’ ©Ae)) + 29 

n 


+ ^ + 


+ 


ifr2 
+ M-e 


+ 


where M = -^ + 2{S + r)). 

By the assumption log{N{BH, IMIoo i^)) — Hence: 


log{2N{BH, IMIoo = log{2) + log{N{BH, IMIoo 

/ . \ —2p / . \ —2p 

<Zog'(2) + a ( vAe) <2a(vAe) 


Choose e = (^) Then 


( 8 ) 


By (7) and (8), 


a (V Ae 


-2p 


=a 


' 1 \ —2p 


(9) 


A ||/d,a||^ + RL,pifD,\) - R*L,P - ^2(A) 


<B 


\ 


/ 1 1 \ —2p 

4a (d) -r (^) +26 ^ ^ ^ ^ 


n 


<B 


/ 

A 

4a (( 

\ 





1 \ -2p 


n 


26 
V n 




+ 


M /P\T© 2+2P 


/ 


i/A '^2/ V n 


=H 


4q ^('2') l+W —) 2+2p^ \ , 1 


n 


/ 


V “P 

p\T+j; 


bv2 (2:i “ p + 

Vny y/x2\nj 


V n 


Recall that H 2 = clA + 1 and H = ^ + B 2 , where Hi is some bound on the derivative of the loss. Since 
0 < A < 1, then 1 < and therefor B 2 < + A“^/^ = A“^/^(cl + 1) < + 1). Earlier we 
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defined M such that K = • Thus, 


Ih , 1 (2{S + t)+t^ 

2K 


B< ^ + — 1 

Vx V 


^/A5i(Mr2 -8(5 + r)) + 8 


- 8{S + t)) 1 /2(5 + r) + r2 

^ 8 + 71V 

2 {S+t)+t ^ ^ ^ 


87a 

where we define N = SifAfr^yg _|_ x _|_ 2 (^7")- 
Hence we can bound (9) by 


< 


87 a 


7a 


—p 

P\l+7 


V2N (2a\ 2 + 2 p M p /2a\ 2 + 2 ? 

7a v^y ~^7A2Vny 


^ 2 ) 7a 

< 

^ 2 ) 7a 


n f 2a\ 2+7 Mn /2aA 2+7 

'^(7 + 7(7 

1 1 n 

/2a A 2 + 2 p Mn /2a A 2 + 2 ? 

^ n j ^17 (7 


iV 29 
TaV re 

iV /20 
TaV re 

iV fw 
TaV re 


Choose Bi > 7 - (2 + 4 (77)) (7 + 25 + 2t) V Note that M = 7 (T + 2(5 + r)) < + 2 + 

4 (~7) = 2,N. Consequently, for Hi > 7 “ (2 + 4 (77)) (7 T 25 + 2r) we have that M < 2N or ^ < 1. 

p 

Note also that (1 + 7 ^ 3, hence: 


paTp 17 

2 ) 7a 


(?)"( 




< 


2 + 2 p N 29 
H —-]= \ — 

7A V re 


7a 


/2a A 2 + 2 p 

6 - + 7 - 

V re y V re 


Since H 2 (A) < cA''' for constants c > 0, and 7 G (0,1], 


( 10 ) 


9 N \ / 

A ||/d,a|7 + ^L,p{fD,x) - Rl,P < cA'^ + ^ 6 


'2aA2+2p jw 
n J V re 


We would like to choose a sequence An that will minimize the bound in (10). Define 1T(A) = cA'^ + 


N 

^/X 


g 2 + 2 p 720 

\ n J ' V n 


. Differentiating W with respect to A and setting to zero yields: 
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2a\ 2+2p 26 

6 (— +\ — 
n V n 


= 0 




c7A^-^ =^NX-^ 


1 


A = (- N 

\2c7 


, 2aA 2+2p jW 
6 (— +\- 

n J V n 


, 2a A 2 + 2 p 

61 — +\ — 
n J V n 




/ 1 2+2p / 1 A ^ 

oc — + ( “ 

\ n \n 


i\ 27+1 


A ocn (i+p)( 27 +i) 

Since the second derivative of W (with respect to A) is positive, A is the minimizer. by (10), 


( 11 ) 


/ N 

Pr ( RL,p{fD,\) - Rl,p < 


2aA 2+2p jW 

6 ( — + i/ - 

n J V n 


> 1 - e 


-e 


By the choice of A^, the bound in equation (11) can be written as 


cn 


_7_ _ 1 _ r 1 1 1 1 

{1+p)(27+1) _|_ _/Yj^2(1+p){ 27+1) 0 (2a) 2+2p fl 2+2p + (20)2 n“ 2 


_ 7 1 _ 7 1 27 ( 1 +p)+p 

= cn (1+p)(27+1) + . 6 ( 2 a) 2 + 2 p n ( 1 +p)( 27 + 1 ) + N ( 20)2 n 2 ( 1 +p)( 27 + 1 ) 

<cn~ (i+p)( 27 +i) + . 6 ( 2 a) 24 ^ n~ (i+p)G7+i) + N ( 20)5 n“(i+p)G 7 +i) 

=n“ (i+p)G 7 +i) + a^ . 6 ( 2 a) 2 T 2 j; + N ( 20 ) 5 ^ 

<Q{1 + \/0)n“(i+DG7+i) 

where Q is a constant that does not depend on n or on 0. 

_ 1 

In conclusion, by choosing a sequence An that behaves like n (i+p)( 27 +i) ^ we have that the resulting learning 
rate is given by 

Pr (^RLAfD,x) - Rl,P < <3(1 + v/0)n“u+p)G7+i)) > i _ 

Case II 

By Theorem 2, 


A \\fD,x\\l + RL,p{fD,x) - Rl,P - A 2 {X) > 


2lo 


X^h,IMIooT))+ 20 3c^£ 


n 


+ -j^ + 4cLe + 2r] 

K 


where rj = ^ ^ with probability not greater than 2e Choose e = (|)^+p (^) 2 + 2 p — 

, (I + 4{S + r)) , > ^ - (6 + 12 (^)) {^ + 4S + 4 t)“\ and define N = + 1 + 2 (^) , then 
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as in (10), a very similar calculation shows that 

A ||/ d , a ||^ + RL,p{fD,\) - Rl,p < 

Choose h = KTi 2 / 3+1 as in (3) and choose a such that ln{a) = + ^ln{Ci) — ^ln{n) + i^ln{2f3C2) as in 

Theorem 2. Then by the definition of rj, 



2K‘^ 

V = — 

2K^ 

_2K'^e^~^ {Ci)H2(3C2)^ , 27 ^ 2 ^ 2 ^^ 

-Dina Sin2/3+i 

2K^ (e^ (Ci)5 ( 2 ^ 02 )^ + C2K^) 

< -^^- -■ 

Bin2/3+i 

Hence, 


(a + C2 • h^) 

(^a + C2 ■ n^n 2/3+1 ^ 


N 


A ||/d,a||^ + RL,p{fD,\) — Rl,P — cA'’' + 


<cA^ + ^ 
a/A 


6 ( — 

n 


2a\ 2 + 2 p 


29 

~\~ \ — 

n 




a/A 

2/3+1 


6 ( 

n 


+ 2r/ 


+ 


(Ci)^ (2/3C2)^ 


Hin2/3+i 

_ 1 

Similarly to Case I, choosing oc n (i+p)( 27 +i) minimizes the last bound (note that the choice of A^ does not 
depend on rf). Hence that the resulting learning rate is given by 


Pr{De{ZxyY : i^L,p(/D,A„) - ^L,P < Q(1 + \/0)n ™’"((i+r)(27+i)’2/+i)) > 1 _ g-0 
where Q is a constant that does not depend on n or on 0. □ 
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